home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Garbo
/
Garbo.cdr
/
mac
/
hypercrd
/
hc1_2_x
/
fretext1.sit
/
Free Text Help_Services v1.01
/
card_24224.txt
< prev
next >
Wrap
Text File
|
1990-04-13
|
6KB
|
118 lines
-- card: 24224 from stack: in.01
-- bmap block id: 0
-- flags: 0000
-- background id: 7910
-- name:
-- part contents for background part 4
----- text -----
/* part 2 of 2 */
EXTENSIONS AND ENHANCEMENTS
As mentioned repeatedly above, some significant yet straightforward
extensions to my current free text IR systems are necessary in order to
properly handle 100 MB/day of data. Here, I will briefly sketch out how
I plan to attack the key problems during the coming months. I assume
that that ample physical storage is available to hold the influx of
information online, in a form which allows access to any item in a
fraction of a second.
My systems have to be modified handle multiple text files as a single
database. I propose to do this by adding a third index file to the
"keys" and "pointers" files -- a "filelist" index file which will simply
contain a list of the database document files along with their lengths.
The structure of the "keys" and "pointers" files will remain unchanged
(which should maximize compatibility with earlier index programs and
minimize the number of new bugs introduced in this step). The index
building programs will treat each of the documents in the "filelist"
file as part of a single big document for indexing purposes, and the
index browsing programs will consult the "filelist" in order to know
where to go to retrieve lines of context or chunks of full text for
display.
A drawback of this multifile "virtual merge" approach is that it will at
times be necessary to open and close files during browsing operations. I
have not yet run tests, and do not know what penalties the operating
system will impose (one hopes only a few milliseconds?) every time a
file is opened and closed. With the use of modern operating system RAM
caches, I hope that speed will not be a problem. During typical
browsing operations, I believe that most database references will be
predictable and localized, so caching should help average performance a
lot.
Another extension which I plan to implement is to add facilities to
rapidly merge separate index files upon user demand. Merging
already-sorted index files is a very fast operation which should be
limited only by disk I/O rates. It will then be possible to keep
indices for each day's (or hour's, or whatever) collection of data
separate until the time at which a user wants to browse a chosen set of
files. The delay to merge the selected separate indices will be about
equal to the time required for a single sequential scan through the
chosen database. After that start-up delay, searches will progress at
the normal full speed (sub-second response time for simple queries).
Many data collections which are commonly referred to as a unit can have
their merged index files kept online to avoid any search delays.
My current browser programs can already run on one computer while
searching files resident elsewhere on a network. The generic UNIX
version of my browser can also already be used as a "server" process and
run on one host, sending back only the (relatively) small amounts of
retrieved information that the user wants to see on a local machine. I
plan to rewrite some parts of my browser programs to make their use as
servers simpler and more efficient; this rewrite will take place along
with the other revisions to introduce new features.
Index building itself should not be a computationally infeasible
operation at a 100 MB/day data rate. My indexer programs already run at
15-20 MB/hour on a 16 MHz 68030 (Mac IIcx), and I have had reports of 60
MB/hour or better performance on faster machines with more memory and
higher performance disk drives. I also believe that there is room for a
20% - 50% speed improvement in my indexing algorithms, by applying some
of the standard quicksort enhancements discussed in many textbooks. For
storing the index files, a simple and obvious modification that I plan
to make is to give the user complete freedom to put databases and index
files in any directory, on any volume (online storage device) that is
desired. This will allow archival databases to reside on high-density
optical read-only media, while indices can be built on faster magnetic
media and can be moved to the archive when appropriate.
To handle databases larger than about 4 GB (2^32) will require
modifications to my programs, but (assuming that disk space is
available) not major trauma. If the pointers and counters in the index
data structures are redeclared to be 6 bytes instead of 4 bytes, for
example, it should be possible to handle up to 256 TB of text in theory.
Index file overhead will go up to about 120% instead of the current 80%
of the database text size. At this point, some simple compression
routines might be worth exploring to increase storage efficiency, if it
can be done without slowing down the retrieval process.
Proximity subset searching should still be straightforward to do using
my simple vector representation of the database, but as files get bigger
it may be necessary to hold the large subset vectors on disk and page
them into memory only as needed. Modern operating systems with large
virtual memory spaces should be able to handle that for the program
automatically. If I keep the current default 32-byte quantization, then
the subset vectors will still be only 0.4% as big as the total database
text size, and so even in the multigigabyte zone each subset will only
require a few megabytes of space.
CONCLUSIONS
My bottom-line evaluation is that a free-text IR system such as I have
built, plus anticipated extensions, will surely break down -- but not
until reaching file sizes in the 10 GB or larger range, or with
information arriving at a rate greater than 1 GB of text per day. Then,
problems with data transfer rates and with indexing speed may force one
to find alternative solutions. Probably a multiprocessor approach using
a partitioned database is the best tactic to take at that point.
Meanwhile, I see a lot of value still to be derived from my real-time
high-bandwidth free-text information retrieval tools, particularly as
the costs of data storage continue to decline.
/* end of part 2 of 2 */